142
Applications in Natural Language Processing
FIGURE 5.15
Overview of BiBERT, applying Bi-Attention structure for maximizing representation infor-
mation and Direction-Matching Distillation (DMD) scheme for accurate optimization.
TABLE 5.7
Quantization results of BiBERT on GLUE
benchmark. The average results of all tasks
are reported.
Method
#Bits
Size
GLUE
BERT-base
full-prec.
418
82.84
BinaryBERT
1-1-4
16.5
79.9
TernaryBERT
2-2-2
28.0
45.5
BinaryBERT
1-1-2
16.5
53.7
TernaryBERT
2-2-1
28.0
42.3
BinaryBERT
1-1-1
16.5
41.0
BiBERT
1-1-1
13.4
63.2
BERT-base6L
full-prec.
257
79.4
BiBERT6L
1-1-1
6.8
62.1
BERT-base4L
full-prec.
55.6
77.0
BiBERT4L
1-1-1
4.4
57.7
In summary, this paper’s contributions can be concluded as: (1) The first work to explore
fully binary pre-trained BERT-models. (2) An efficient Bi-Attention structure for maximiz-
ing representation information statistically. (3) A Direction-Matching Distillation (DMD)
scheme to optimize the full binarized BERT accurately.
5.10
BiT: Robustly Binarized Multi-Distilled Transformer
Liu et al.[156] further presented BiT to boost the performance of fully-binarized BERT
pre-trained models. In their work, a series of improvements that enable binary BERT was
identified, which includes a two-set binarization scheme, an elastic binary activation func-
tion with learned parameters, a method to quantize a network to its limit by successively